Part-of-speech n-gram and word n-gram fused language model

نویسندگان

  • Hirofumi Yamamoto
  • Yoshinori Sagisaka
چکیده

In this paper, an accurate and com pact language m odel is proposed to cope robustly with data sparseness and task dependencies. This language m odel adopts new categories which are generated by continuously interpolating POS word-class categories and word categories using M AP estimation. Thenew categories can reflect word statistics efficiently without loosing accuracy and task-independent general word-characteristics (i.e. gram m atical constraints captured by POS statistics) are em bedded to prevent taskovertuning. This m odeling reduces the m odel size to 50% of the conventional m odels. The bidirectional word-cluster N-gram s generated by this m odeling have 3% lower perplexity m easured on a m atched dom ain and 15% lower on a m ism atched domain compared to a conventi onal word 2-gram.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Statistical Language Modeling with pro for Continuous Speech

A new statistical language modeling was proposed where word n-gram was counted separately for the cases crossing and not crossing accent phrase boundaries. Since such counting requires a large speech corpus, which hardly can be prepared, part-of-speech (POS) n-gram was first counted for a small-sized speech corpus for the two cases instead, and then the result is applied to word n-gram counts o...

متن کامل

Transforming out-of-domain estimates to improve in-domain language models

Standard statistical language modeling techniques suffer from sparse-data problems when applied to real tasks in speech recognition, where large amounts of domain-dependent text are not available. In this work, we introduce a modi ed representation of the standard word n-gram model using part-of-speech (POS) labels that compensates for word and POS usage di erences across domains. Two di erent ...

متن کامل

Linear Reranking Model for Chinese Pinyin-to-Character Conversion

Pinyin-to-character conversion is an important task for Chinese natural language processing tasks. Previous work mainly focused on n-gram language models and machine learning approaches, or with additional hand-crafted or automatic rule-based post-processing. There are two problems unable to solve for word n-gram language model: out-of-vocabulary word recognition and long-distance grammatical c...

متن کامل

Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition

This paper describes our research on statistical language modeling of Lithuanian. The idea of improving sparse n-gram models of highly inflected Lithuanian language by interpolating them with complex n-gram models based on word clustering and morphological word decomposition was investigated. Words, word base forms and part-of-speech tags were clustered into 50 to 5000 automatically generated c...

متن کامل

Using Collocations and K-means Clustering to Improve the N-pos Model for Japanese IME

Kana-Kanji conversion is known as one of the representative applications of Natural Language Processing (NLP) for the Japanese language. The N-pos model, presenting the probability of a Kanji candidate sequence by the product of bi-gram Part-of-Speech (POS) probabilities and POS-to-word emission probabilities, has been successfully applied in a number of well-known Japanese Input Method Editor ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999